XML Reconstruction View Selection in XML Databases: 
Complexity Analysis and Approximation Scheme 

Artem Chebotko and Bin Fu 

Department of Computer Science 

University of Texas-Pan American 

Edinburg, TX 78539, USA 

O. {artem, binfu}@cs.panam.edu 

in 






r- 



X 



July 19, 2010 



Abstract 

Query evaluation in an XML database requires reconstructing XML subtrees rooted at nodes found 
by an XML query. Since XML subtree reconstruction can be expensive, one approach to improve query 
response time is to use reconstruction views - materialized XML subtrees of an XML document, whose 
t/3 , nodes are frequently accessed by XML queries. For this approach to be efficient, the principal require- 

^ ' ment is a framework for view selection. In this work, we are the first to formalize and study the problem 

of XML reconstruction view selection. The input is a tree T, in which every node i has a size ci and 
profit Pi, and the size limitation C. The target is to find a subset of subtrees rooted at nodes ii, • ■ • , ife 
respectively such that c^^ + ■ • • + c^^, < C, and p^^ + • ■ • + p^^, is maximal. Furthermore, there is no 
overlap between any two subtrees selected in the solution. We prove that this problem is NP-hard and 
\^ . present a fully polynomial-time approximation scheme (FPTAS) as a solution. 

(N 

§ ; 1 Introduction 

With XML^ [1] being the de facto standard for business and Web data representation and exchange, storage 
and querying of large XML data collections is recognized as an important and challenging research problem. 
A number of XML databases [2, 4, 6-8, 14, 18, 19, 22, 27, 28] have been developed to serve as a solution to 
C^ \ this problem. While XML databases can employ various storage models, such as relational model or native 

XML tree model, they support standard XML query languages, called XPath^ and XQuery^. In general, 
an XML query specifies which nodes in an XML tree need to be retrieved. Once an XML tree is stored 
into an XML database, a query over this tree usually requires two steps: (1) finding the specified nodes, if 
any, in the XML tree and (2) reconstructing and returning XML subtrees rooted at found nodes as a query 
result. The second step is called XML subtree reconstruction [9, 10] and may have a significant impact on 
query response time. One approach to minimize XML subtree reconstruction time is to cache XML subtrees 
rooted at frequently accessed nodes as illustrated in the following example. 

Consider an XML tree in Figure 1(a) that describes a sample bookstore inventory. The tree nodes 
correspond to XML elements, e.g., bookstore and hook, and data values, e.g., "Arthur" and "Bernstein" , 
and the edges represent parent-child relationships among nodes, e.g., all the hook elements are children of 
hookstore. In addition, each element node is assigned a unique identifier that is shown next to the node in 
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(b) Edge table (c) XML reconstruction view 

Figure 1 : An example of an XML tree, its relational storage, and XML reconstruction view 



the figure. As an example, in Figure 1(b), we show how this XML tree can be stored into a single table in an 
RDBMS using the edge approach [14]. The edge table rg^^e stores each XML element as a separate tuple 
that includes the element ID, ID of its parent, element name, and element data content. A sample query over 
this XML tree that retrieves books with title "Database Systems" can be expressed in XPath as: 

/bookstore/book [title=" Database Systems" ] 

This query can be translated into relational algebra or SQL over the edge table to retrieve IDs of the book 
elements that satisfy the condition: 

''^r2.ID ( 

T\ i^ri.ID=r2.parentIDAri.name='bookstore'A 

ri.parentID is NULL/\T2-name='book' 
f2 ^^r2.ID=r3.parentIDAr3.name=Hitle' A 

r-j .content=' DatabaseSystems' 

n ) 



where ri, r2, and r^ are aliases of table redge- For the edge table in Figure 1(b), the relational algebra query 
returns ID "2", that uniquely identifies the first book element in the tree. However, to retrieve useful infor- 
mation about the book, the query evaluator must further retrieve all the descendants of the book node and 
reconstruct their parent-child relationships into an XML subtree rooted at this node; this requires additional 
self -joins of the edge table and a reconstruction algorithm, such as the one proposed in [9]. Instead, to 
avoid expensive XML subtree reconstruction, the subtree can be explicitly stored in the database as an XML 
reconstruction view (see Figure 1(c)). This materialized view can be used for the above XPath query or any 
other query that needs to reconstruct and return the book node (with ID "2") or its descendant. 

In this work, we study the problem of selecting XML reconstruction views to materialize: given a set 
of XML elements D from an XML database, their access frequencies Cj (aka workload), a set of ancestor- 
descendant relationships AD among these elements, and a storage capacity 6, find a set of elements M from 
D, whose XML subtrees should be materialized as reconstruction views, such that their combined size is no 
larger than 6. To our best knowledge, our solution to this problem is the first one proposed in the literature. 
Our main contributions and the paper organization are as follows. In Section 2, we discuss related work. 
In Section 3, we formally define the XML reconstruction view selection problem. In Sections 4 and 5, we 
prove that the problem is NP-hard and describe a fully polynomial-time approximation scheme (FPTAS) for 
the problem. We conclude the paper and list future work directions in Section 7. 

2 Related Work 

We studied the XML subtree reconstruction problem in the context of a relational storage of XML documents 
in [9, 10], where several algorithms have been proposed. Given an XML element returned by an XML query, 
our algorithms retrieve all its descendants from a database and reconstruct their relationships into an XML 
subtree that is returned as the query result. To our best knowledge, there have been no previous work on 
materializing reconstruction views or XML reconstruction view selection. 

Materialized views [3, 13,23,24,29,31] have been successfully used for query optimization in XML 
databases. These research works rewrite an XML query, such that it can be answered either using only 
available materialized views, if possible, or accessing both the database and materialized views. View main- 
tenance in XML databases has been studied in [25,26]. There have been only one recent work [30] on 
materialized view selection in the context of XML databases. In [30], the problem is defined as: find views 
over XML data, given XML databases, storage space, and a set of queries, such that the combined view 
size does not exceed the storage space. The proposed solution produces minimal XML views as candidates 
for the given query workload, organizes them into a graph, and uses two view selection strategies to choose 
views to materialize. This approach makes an assumption that views are used to answer XML queries 
completely (not partially) without accessing an underlying XML database. The XML reconstruction view 
problem studied in our work focuses on a different aspect of XML query processing: it finds views to mate- 
rialize based on how frequently an XML element needs to be reconstructed. However, XML reconstruction 
views can be complimentarily used for query answering, if desired. 

Finally, the materialized view selection problem have been extensively studied in data warehouses [5, 
11, 16, 17,20,32] and distributed databases [21]. These research results are hardly applicable to XML tree 
structures and in particular to subtree reconstruction, which is not required for data warehouses or relational 
databases. 

3 XML Reconstruction View Selection Problem 

In this section, we formally define the XML reconstruction view selection problem addressed in our work. 
Problem formulation. Given n XML elements, D = {Di,D2, • • • , Dn}, and an ancestor-descendant 



relationship AD over D such that if {Dj,D,i) € AD, then Dj is an ancestor of Di, let COSTji{Di) 
be the access cost of accessing unmaterialized Di, and let COSTA{Di) be the access cost of accessing 
materialized Di. We have COSTA{Di) < COSTR{Di) since reconstruction of Di takes time. We use 
size{Di) to denote the memory capacity required to store a materialized XML element, size{Di) > 
and size{Di) < size{Dj) for any {Dj,D,i) G AD. Given a workload that is characterized by ai{i = 
1,2, ... ,n) representing the access frequency of Di. The XML reconstruction view selection problem is to 
select a set of elements M from D to be materialized to minimize the total access cost 

n 

t{D,M) = ^oi X COST{D,i), 
1=1 

under the disk capacity constraint 

y size{Di) < 5, 

Di£M 

where COST{Di) = COSTA{Di) if A G M or for some ancestor Dj of A, Dj G M, otherwise 
COST{Di) = COSTR{Di). 5 denotes the available memory capacity, 5 > 0. 

Next, let \/COST{Di) = COSTr{D.i) - COSTa{D.i) means the cost saving by materialization, then 
one can show that function r is minimized if and only if the following function A is maximized 

\{D, M)= Y^ aiX \jCOST{Di) 

Di£M+ 

under the disk capacity constraint 

y ^ size{Di) < 5, 

DiGM 

where Af + represents all the materialized XML elements and their descendant elements in D, it is 
defined as M+ = {Di \ Di e M or 3Dj.{Dj,Di) e AD A Dj G M}. 

4 NP- Completeness 

In this section, we prove that the XML reconstruction view selection problem is NP-hard. First, the maxi- 
mization problem is changed into the equivalent decision problem. 

Equivalent decision problem. Given D, AD, size{Di), ai, \jCOST{Di) and 5 as defined in Sec- 
tion 3, let K denotes the cost saving goal, i^ > 0. Is there a subset M C D such that 

Y ttiX yCOST{Di) > K 

D,eM+ 

and 



y size{Di) < 5 



Di<=M 



M"*" represents all the materialized XML elements and their descendant elements in D, it is defined as 
M+ = {Di\Di£ M or 3Dj.{Dj,Di) £ AD A Dj G M}. 

In order to study this problem in a convenient model, we have the following simplified version. 



The input is a tree T, in which every node i has a size Cj and profit pi, and the size limitation C. The 
target is to find a subset of subtrees rooted at nodes ii , • • • , ifc such that q^ + • • • +Cjj. < C, and Pi^ + - • • +Pif, 
is maximal. Furthermore, there is no overlap between any two subtrees selected in the solution. 

We prove that the decision problem of the XML reconstruction view selection is an NP-hard. A polyno- 
mial time reduction from KNAPSACK [15] to it is constructed. 

Theorem 4.1 The decision problem of the XML reconstruction view selection is NP-complete. 

Proof: It is straightforward to verify that the problem is in NP. Restrict the problem to the well-known 
NP-complete problem KNAPSACK [15] by allowing only problem instances in which: 

Assume that a Knapsack problem has input (pi, ci), •, (p„, c„), and parameters K and C. We need to 
determine a subset S* C {1, • • • , n} such that J2ieS ^i — ^ ^^^^ YliesPi — ^■ 

Build a binary tree T with exactly leaves. Let leaf i have profit pi and size q. Furthermore, each internal 
node, which is not leaf, has size oo and profit oo. 

Clearly, any solution cannot contain any internal due to the size limitation. We can only select a subset 
of leaves. This is equivalent to the Knapsack problem. 

D 

Finally, we state the NP-hardness of the XML reconstruction view selection problem. 

Theorem 4.2 The XML reconstruction view selection problem is NP-hard. 

Proof: It follows from Theorem 4.1, since the equivalent decision problem is NP-complete. □ 

5 Fully Polynomial-Time Approximation Scheme 

We assume that each parameter is an integer. The input is n XML elements, D = {Di,D2,- ■ ■ , Dn} which 
will be represented by an AD tree J, where each edge in J shows a relationship between a pair of parent 
and child nodes. 

We have a divide and conquer approach to develop a fully approximation scheme. Given an AD tree J 
with root r, it has subtrees Ji , • • • , Jk derived from the children ri , • • • , r^ of r. We find a set of approximate 
solutions among Ji, • • • , Jk/2 and another set of approximate solutions among Jfc/2+1, • • • , Jk- 

We merge the two sets of approximate solutions to obtain the solution for the union of subtrees Ji , • • • , Jk- 
Add one more solution that is derived by selecting the root r of J. Group those solutions into parts such 
that each part contains all solutions P with similar X{D,P). Prune those solution by selecting the one from 
each part with the least size. This can reduce the number of solution to be bounded by a polynomial. 

We will use a list P to represent the selection of elements from D. 

For a Ust of elements P, define A(P) = ^D.^pCii x VCOST{Di), and ^{P) = ^^,-psize{Di). 
Define x(^) be the largest product of the node degrees along a path from root to a leaf in the AD tree J. 

Assume that e is a small constant with 1 > e > 0. We need an (1 + e) approximation. We maintain a 
list of solutions Pi,P2,- ■ ■ , where Pi is a list of elements in D. 

Let / = (! + !) with z = 21ogx(J). Let w = Y.l=i a^COSTniD,) and s = YJl=i size{D,). 

Partition the interval [0,w] into /i,/2, • • • ,Iti such that Ii = [0, 1] and Ik = {bk~i,bk] with bk = 
f • bk-i for k <t, and It^ = {bt-i,w], where bt^i <w< fbt-i- 

Two lists Pi and Pj, are in the same region if there exist Ik such that both \{Pi) and \{Pj) are Ik- 

For two lists of partial solutions Pi = Di^ ■ ■ ■ Di^ and Pj = Dj^ • • • Dj^ , their link Pi o Pj = 
Di, ■■■ Di Di, ---Di . 

Prune (L) 

Input: L is a list of partial solutions Pi, ^2, • ' " > Pm', 



Partition L into parts C/i , • • • , [7^, such that two lists Pi and Pj are in the same part if Pi and Pj are 
in the same region. 

For each Ui, select Pj such that n{Pj) is the least among all Pj in Ui, 
End of Prune 

Merge (Li,L2) 

Input: Li and L2 are two lists of solutions. 

Let L = 0; 

For each Pj G Li and each P, G L2 
append their link Pj o Pj to L; 

Return L; 
End of Merge 

Union(Li,L2,--- ,Lk) 

Input: Li, • • • ,Lj^ we. lists of solutions. 

If fc = 1 then return Li; 

Return Prune(Merge(Union(Li , • • • , Lj,/2), 

Union(Lfc/2+i,--- ,Lfc))); 
End of Union 

Sketch (J) 

Input: J is a set of n elements according to their AD. 

If J only contains one element Di, return the list L = Di,% with two solutions. 

Partition the list J into subtrees Ji , • • • , J^ according to its k children. 

Let Lq be the list that only contains solution J. 

for i = 1 to A; let Lj=Sketch(Jj); 

Return Union(Lo, ^i, • ' ' ) -^fe); 
End of Sketch 

For a list of elements P and an AD tree J, P[J] is the list of elements in both P and J. If Ji, • • • , J^ 
are disjoint subtrees of an AD tree, P[Ji, • • • , Jfc] is P[Ji] o • • • o P[Jfe]. 

In order to make it convenient, we make the tree J normalized by adding some useless nodes Di with 
COSTr{D) = COSTa{D) = 0. The size of tree is at most doubled after normalization. In the rest of the 
section, we always assume J is normalized. 

Lemma 5.1 Assume that Li is a list of solutions for the problem with AD tree Jifor z = 1, • • • ,k. Let Pi G 

Li for i = 1, • • • ,k. Then there exists P £ L =Union(Li, ■ ■ ■ ,Lf^) such that A(P) < f^°^^ ■ X{Pio- • -oP^) 
and^i{P*) < ^(Pi o • • • o Pfe)). 

Proof: We prove by induction. It is trivial when k = 1. Assume that the lemma is true for cases less than 

k. 

Let Ml = Union(Li, • • • , Lfc/2) and M2 = Union(Lfc/2+i, • • • , L^.). 

Assume that Mi contains Qi such that \{Qi) < /i°g(^/2)A(Pi o • • • o P^/2) and fi{Qi) < fi{Pi o • • • o 
Pk/2). 

Assume that M2 contains Q2 such that X{Q2) < f°^''^^'^^KPk/2+i ° ■ ■ -oPk]) and fJ,{Q2) < ^(Pfc/2+1 ° 

• • • o Pfe). 

Let Q = Qio Q2. Let Q* be the solution in the same region with Q and has the least n{Q*). Therefore, 



KQ*) 
< fKQ) 

< /'°§('/'VA(Pi o • • • o Pfc) 

< /'°S'=A(Pi0...oPfc). 

Since fJ,{Qi) < ^(A o • • • o Pk/2) and ii{Q2) < l^{Pk/2+i ° ■ ■ ■ ° Pk)^ we also have 



< ^x{Pl o • • • o Pf^i^) + ^(Pfe/2+1 o • • • o -Pfc) 

= ^(Pl O • • • O Pfc) 



D 



Lemma 5.2 Assume that P is an arbitrary solution for the problem with AD tree J. For L=Sketch(J), 
there exists a solution P' in the list L such that A(P') < /i°sx(-^) • X{P) and fJ^iP') < fJ-iP)- 

Proof: We prove by induction. The basis at | J| < 1 is trivial. We assume that the claim is true for all 
\J\ < m. Now assume that \J\ = m and J has k children which induce k subtrees Ji, • • • , Jk- 

Let Li = P[Ji\ for i = 1, • • • , A;. By our hypothesis, for each i with 1 < i < k, there exists Qi G Lj 
such that \{Qi) < /i°gx(J.) . X{P[Ji]) and fi{Qi) < fi{P[J^])). 

Let M=Union(Li, • • • , L^). By Lemma 5.1, there exists P' ^ M such that 

< ^max{logx(Ji),---,logx(-'fc/2)} 

< /i°gx(J)A(p[j,,...,j,]) 
= /°e^(-^)A(P). 



and 



KP') < KQi°---°Qk) 
< fi{P[Ji,--- ,Jk]) 
= KP)- 



Lemma 5.3 Assume that ^{D , J) < a{n). Thenthe computational time for Sketch( J) is 0{\J\{ 
where \J\ is the number of nodes in J. 



(logx(J))(loga("))\2^ 



Proof: The number of intervals is 0{ ^^ )J( pg«l"-J) j_ Therefore the list of each Li=Prune(Ji) is of 
length o( a°gx(J))(ioga(n)) ^_ 



LetF(A;)bethetimeforUnion(Li, • • • ,Lk). It satisfies the recursion F(/c) = 2F(fc/2)+0(( ^^°^^^'^^^J'°^°^"^^ )^) 
Tliis brings solution F{k) = 0(fe(Mx(:ZM£Si^M)2). 

Let T{J) be the computational time for Prune( J). 

Denote E{J) to be the number of edges in J. 

We prove by induction that T{J) < c£;(J)(ii^£^^i^^M^£^M))2 foj. ^q^q constant c > 0. We select 
constant c enough so that merging two lists takes c{n log a{n))'^) steps. 

We have that 



T{J) < T{Ji) + --- + T{Jk) + F{k) 

KJ))(loga(n)) . 

e 
(logx(-'^))(loga(n)) 2 



< cE{.h){^^^^^:^^^^^^^^f + 



+ cE{Jk){- 



€ 



+ck{ 



{log x{J)) {log a{n)) 2 



€ 

< cE{J){ 



{log x{J)) {log a{n)) 2 



< c\J\ 



e 

. {log x{J)) {log a{n)) 2 



e 
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We now complete the common procedure. Before we give the FPTAS for our problems, we first give the 

following lemma which facilitates our proof for FPTAS. One can refer the algorithm book [12] for its proof. 

Lemma 5.4 (l)For < x < 1, e^ < 1 + x + x^. 
(2)For real y > 1, (1 + ^)y < e. 

Algorithm 

Approximate(J, e) 

Input: J is an AD tree with elements Di,- ■ ■ , Dn and e is a small constant with 1 > e > 0; 

LetL =Sketch(J); 

Select Pi from the list L that Pi has the optimal cost; 
End of the Algorithm 

Theorem 5.5 For any instance of J of an AD tree with n elements, there exists an 0{n{ °^ Jiv °g°v"-;-' ^2-j 
time approximation scheme, where Y17=i 0'iCOSTR{Di) < a{n). 

Proof: Assume that P is the optimal solution for input J. Let L=Prune(J). By Lemma 5.2, we have 
P* £ L that satisfies the condition of Lemma 5.2. 



A(P*) < /'°s'^(^U(P) 
e 

< ei ■ \{P) 

< (l + e)-A(P). 



(l + ^)logx(J)A(p) 



(l + - + (-)2).A(P) 



Furthermore, /i(P*) < li{P)- 

The computational time follows from Lemma 5.3. □ 

It is easy to see that x( J) < 2l"^L We have the following corollary. 



Corollary 5.6 For any instance of J of an AD tree with n elements, there exists an 0{n^{ "S"'-"-' )^) time 
approximation scheme, where ^"=i aiCOSTii{Di) < a{n). 

6 Extension to Multiple Trees 

In this section, we show an approximation scheme for the problem with an input of muhiple trees. The input 
is a series of trees Ji , • • • ,Jk- 

Theorem 6.1 For any instance of J of an AD tree with n elements, there exists an 0(ra( ^°^'^^ ojji, ogaoi"j; -j2^ 
time approximation scheme, where '^Apj +Cj) < ao{n) and Jq is a tree via connecting all Ji,- ■ ■ ,Jk into 
a single tree under a common root tq. 

Proof: Build a new tree with a new node ro such that Ji, • • • , Jfc are the subtrees under tq. 

Apply the algorithm in in Theorem 5.5. □ 

7 Conclusion and Future Work 

In this work, we studied the problem of XML reconstruction view selection that promises to improve query 
evaluation in XML databases. We were first to formally define this problem: given a set of XML elements 
D from an XML database, their access frequencies Oj (aka workload), a set of ancestor-descendant relation- 
ships AD among these elements, and a storage capacity 5, find a set of elements M from D, whose XML 
subtrees should be materialized as reconstruction views, such that their combined size is no larger than 5. 
Next, we showed that the XML reconstruction view selection problem is NP-hard. Finally, we proposed a 
fully polynomial-time approximation scheme (FPTAS) that can be used to solve the problem in practice. 

Future work for our research includes two main directions: (1) an extension of the proposed solution 
to support multiple XML trees and (2) an implementation and performance study of our framework in an 
existing XML database. 
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